26 research outputs found

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Near-Memory Address Translation

    Get PDF
    Virtual memory (VM) is a crucial abstraction in modern computer systems at any scale, from handheld devices to datacenters. VM provides programmers the illusion of an always sufficiently large and linear memory, making programming easier. Although the core components of VM have remained largely unchanged since early VM designs, the design constraints and usage patterns of VM have radically shifted from when it was invented. Today, computer systems integrate hundreds of gigabytes to a few terabytes of memory, while tightly integrated heterogeneous computing platforms (e.g., CPUs, GPUs, FPGAs) are becoming increasingly ubiquitous. As there is a clear trend towards extending the CPU's VM to all computing elements in the system for an efficient and easy to use programming model, the continuous demand for faster memory accesses calls for fast translations to terabytes of memory for any computing element in the system. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of today's translation caches, such as TLBs. In this thesis, we provide fundamental insights into the reason why address translation sits on the critical path of accessing memory. We observe that the traditional fully associative flexibility to map any virtual page to any page frame precludes accessing memory before translating. We study the associativity in VM across a variety of scenarios by classifying page faults using the 3C model developed for caches. Our study demonstrates that the full associativity of VM is unnecessary, and only modest associativity is required. We conclude that capacity and compulsory misses---which are unaffected by associativity---dominate, while conflict misses rapidly disappear as the associativity of VM increases. Building on the modest associativity requirements, we propose a distributed memory management unit close to where the data resides to reduce or eliminate the TLB miss penalty

    Prebúsqueda hardware en aplicaciones comerciales

    Get PDF
    Las técnicas de prebúsqueda intentan solventar la diferencia entre la latencia de acceso a memoria y el tiempo de ciclo del procesador. Esta diferencia, conocida como “Memory Gap” o “Memory Wall”, es de dos órdenes de magnitud y continúa aumentando. Junto a factores térmicos y de alimentación constituye la principal la limitación para incrementar el rendimiento de los procesadores actuales. Técnicas propuestas recientemente aprovechan el hecho de que las secuencias de referencias a memoria se repiten a lo largo del tiempo y además en el mismo orden. El talón de Aquiles de estos prebuscadores es la imposibilidad de predecir accesos a memoria para direcciones que no se han visitado anteriormente. Este coste de oportunidad se magnifica en aplicaciones donde una gran mayoría del conjunto de datos del programa es leía una sola vez. Existe otro tipo de técnicas basadas en la observación de que los programas acceden al espacio de direcciones mediante patrones comunes alineados a regiones de memoria, y son predecibles mediante correlación en código. Una de las limitaciones más importantes de este tipo de prebuscadores es la necesidad de un primer acceso en la región para comenzar la predicción de los bloques que van a ser referenciados dentro de la región, el cual no puede ser amortizado con los bloques prebuscados correctamente más allá del tamaño de la región, que es finito. Otro es el mecanismo en sí, antes explicado que inicia las predicciones. Debido a que realiza la predicción para toda la región, si una predicción es incorrecta perdemos toda la oportunidad de predecir correctamente dentro de la misma. El estado del arte en prebúsqueda de datos correlaciona temporalmente los accesos que inician predicciones dentro de las regiones. Este PFC muestra las ineficiencias de los prebuscadores más avanzados y propone dos técnicas para solventarlas: (1) mecanismo de predicción por secuencia de delta direcciones – iniciando las predicciones mediante la última secuencia de deltas observada en la secuencia de referencias a memoria, y (2) predicción por acceso – cada vez que el procesador envía una petición de datos a la jerarquía de memorias, se realiza una predicción. Los resultados indican que la correlación a través de deltas puede predecir patrones recurrentes de acceso a direcciones de memoria nunca antes referenciadas mejorando la predicción temporal de accesos a memoria, y que las predicciones por acceso mejoran los resultados obtenidos por los prebuscadores basados en regiones eliminando la limitación de coste de oportunidad de las predicciones incorrectas dentro de la región

    Pigment Content of D1-D2-Cytochrome b559 Reaction Center Preparations after Removal of CP47 Contamination: An Immunological Study

    Get PDF
    5 pages, figures, and tables statistics.Isolated D1 -D2-cytochrome b559 photosystem I1 reaction center preparations with pigment stoichiometry higher than 4 chlorophylls per 2 pheophytins can be contaminated with CP47 proximal antenna complex. Reaction centers prepared by a modification of the Nanba-Satoh procedure and containing about 6 chlorophylls per 2 pheophytins showed immuno-cross-reactivity when probed with a monoclonal antibody raised against the CP47 polypeptide. Furthermore, they could be fractionated successfully by Superose- 12 sieve chromatography into two different populations. The first few fractions off the column contained a more definitive 435 nm shoulder corresponding to increased chlorophyll content, and showed strong immuno-cross-reactivity with the CP47 antibody. The peak fractions off the column displayed a less prominent 435 nm shoulder, and did not cross-react with the antibody. Moreover, when a 6-chlorophyll preparation was mixed with Sepharose beads coupled to CP47 antibody, the eluted material corresponded to a preparation of about 4 chlorophylls per 2 pheophytins and did not show any crossreaction with the antibody against CP47. The amount of CP47 protein in the 6-chlorophyll preparation as quantitated using Coomassie Blue staining or from gel blots was sufficient to account for most of the extra 2 chlorophylls. We conclude that D 1 -D2-cytochrome b559 preparations containing more than 4 chlorophylls per 2 pheophytins can be contaminated with small amounts of CP47-D1 -D2-Cyt b559 complex and that native photosystem I1 reaction centers contain 4 core chlorophylls per 2 pheophytins.Peer reviewe

    ExtOS: Data-centric Extensible OS

    Get PDF

    Design Guidelines for High-Performance SCM Hierarchies

    Full text link
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

    BuMP: Bulk Memory Access Prediction and Streaming

    Get PDF
    With the end of Dennard scaling, server power has emerged as the limiting factor in the quest for more capable datacenters. Without the benefit of supply voltage scaling, it is essential to lower the energy per operation to improve server efficiency. As the industry moves to lean-core server processors, the energy bottleneck is shifting toward main memory as a chief source of server energy consumption in modern datacenters. Maximizing the energy efficiency of today's DRAM chips and interfaces requires amortizing the costly DRAM page activations over multiple row buffer accesses. This work introduces Bulk Memory Access Prediction and Streaming, or BuMP. We make the observation that a significant fraction (59-79%) of all memory accesses fall into DRAM pages with high access density, meaning that the majority of their cache blocks will be accessed within a modest time frame of the first access. Accesses to high-density DRAM pages include not only memory reads in response to load instructions, but also reads stemming from store instructions as well as memory writes upon a dirty LLC eviction. The remaining accesses go to low-density pages and virtually unpredictable reference patterns (e.g., hashed key lookups). BuMP employs a low-cost predictor to identify high-density pages and triggers bulk transfer operations upon the first read or write to the page. In doing so, BuMP enforces high row buffer locality where it is profitable, thereby reducing DRAM energy per access by 23%, and improves server throughput by 11% across a wide range of server applications

    Unlocking Energy

    Get PDF
    Locks are a natural place for improving the energy efficiency of software systems. First, concurrent systems are mainstream and when their threads synchronize, they typically do it with locks. Second, locks are well-defined abstractions, hence changing the algorithm implementing them can be achieved without modifying the system. Third, some locking strategies consume more power than others, thus the strategy choice can have a real effect. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with improving their throughput. It is a win-win situation. We make our case for this throughput/energy-efficiency correlation through a series of observations obtained from an exhaustive analysis of the energy efficiency of locks on two modern processors and six software systems: Memcached, MySQL, SQLite, RocksDB, HamsterDB, and Kyoto Kabinet. We propose simple lock-based techniques for improving the energy efficiency of these systems by 33% on average, driven by higher throughput, and without modifying the systems

    Meet the Walkers:Accelerating Index Traversals for In-memory Databases

    Get PDF
    The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the de-sign of DBMS accelerators that (a) maximize utility by ac-celerating the dominant operations, and (b) provide flexibil-ity in the choice of DBMS, data layout, and data types. We study data analytics workloads on contemporary in-memory databases and find hash index lookups to be the largest single contributor to the overall execution time. The critical path in hash index lookups consists of ALU-intensive key hashing followed by pointer chasing through a node list. Based on these observations, we introduce Widx, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by (1) decoupling key hashing from the list traversal, and (2) processing multiple keys in parallel on a set of programmable walker units. Widx reduces design cost and complexity through its tight integra-tion with a conventional core, thus eliminating the need for a dedicated TLB and cache. An evaluation of Widx on a set of modern data analytics workloads (TPC-H, TPC-DS) us-ing full-system simulation shows an average speedup of 3.1x over an aggressive OoO core on bulk hash table operations, while reducing the OoO core energy by 83%
    corecore